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(54) Probalistic diagnosis, in particular for embedded and remote applications 



(57) Disclosed is a diagnosis engine for diagnosing 
a device with a plurality of components. States of the 
components are assumed to be probabilistically inde- 
pendent, for computing the probability of any particular 
set of components being bad and all others being good. 
States of shared functions, applicable for testing the 
functionality of some components in the same way, are 
assumed to be probabilistically independent given com- 
ponent states, for computing the probability of any par- 
ticular set of shared functions being failed and another 
particular set of shared functions being passed given 
that a particular set of components are bad. States of 
tests applicable on the device are assumed to be prob- 
abilistically independent given component and shared 
function states, for computing the probability of any par- 



ticular set of tests being failed and another particular set 
of shared functions being passed given that a particular 
set of components are bad, and the rest are good, and 
a particular set of shared functions are failed, and the 
rest are passed. The diagnosis engine receives test re- 
sults of a set of tests on the device where at least one 
test has failed, and a model giving the coverage of the 
tests on the components of the device and information 
describing probabilistic dependencies between the 
tests. The diagnosis engine comprises means for set- 
ting or specifying a number N of components which may 
be simultaneously bad, and computing means for com- 
puting the likelihood that each of subsets of the compo- 
nents with size less than or equal to N are the bad com- 
ponents, whereby the computation is substantially exact 
within floating point computation errors. 
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Description 

BACKGROUND OF THE INVENTION 



5 KUJSSKKTjr 10 monitorin9 ' detectin9 ' and isolating failures in a system * and in particuiar * 

[0002] To diagnose" means to determine why a malfunctioning device is behavina incorrectlv Mnrp Wm=ik > 

Sit Tr: a subset of a prede,ermined set ° f — iisrss^A^^ 

must both explain the mcorrect behavior and optimize some objective function, such as probability o correc nessTr 

Th d ' a9n0SiS ' n6ed 10 diagn ° Se is a common reason to ™asure or to test * 
ESS, J i hedlagnos ' s of an en 9ineered device for the purpose of repair or process improvement shall now be re 

SSL 7? 13 T t0 ' Say ' 8 diStribUt6d C ° mputer system containin 9 «*»™ objects that may be crLSi or 
destroyed at any time. It is assumed that the device consists of a finite number of replaceable component^ Tf51s 

is d r^ a T USed ° nly by haVinQ ° ne ° r more bad c ^P<>nents. What shal. be called here? Xntfe4 Sten 

DK I ?ft in E .r rt SV ? mS t haV f b ! Gn US6d f ° r diagn ° Sing COmputer failures ' as ^ibed e.g. by J.A. Kavicky and G 

ACM 1 989^ Analysis of data from on-bus diagnosis hardware is described in Fitzgerald, G.L., "Enhance computer faul 
isolat.on iwrth . .history memory," IEEE ,1 980. Fault-tolerant computers have for many years beSlK 
processmg and memory elements, data pathways, and built-in monitoring capabilities for determinTn Ihe no s^Sh 
off a failing unit and switch to a good, redundant unit (cf. e.g. US-A-5 099 485) 

based d agnostic systems. A model-based diagnostic system may be defined as a diagnostic sysemtMwdes 
^onsabmthestateofmeSUTusingactualSUTresponsesfromappliedtestsan^ 

computer-generated models of the SUT and its components and the diagnostic process * P 

[0006) Model-based diagnostic systems are known e.g. from W. Hamscher, L. Console, J. de Kleer in -Readings in 
system model-based d.agnosis' Morgan Kauffman, 1 992. A test-based system model is used by the Hewlett Sard 
Lr^SlS and " <U Hewlett-Packard'S ^199 ^ 

the modeling burden ,s greatly reduced. The model disclosed in Preist et al. employs a list of functional tesfs a Hst of 
components exercised by each functiona. test along with the degree to which each component wSSS by Set 
unctional test, and the historical or estimated a priori failure rate for individual components V 
[0008 US-A-5,922,079 (Booth et al.) discloses an automated analysis and troubleshooting system that identifies 
potential problems witt .the test suite (ability of the model to detect and discriminate among po» 
identifies probable modeling errors based on incorrect diagnoses 

frihf^" 8 , 87 , 733 ( !r (aneVSky 61 aL) diSC '° SeS 3 model " ba ^d diagnostic system that provides automated tools 1 

balden an ^t^^^^ te ^ tea ^ M ^^^^^^^^ 
based upon a manageable model of the device under test 

SliUmU! T°f fc ! n9inS be USSd With aD P lications where a filing device is to be debugged using a pre- 
detenminedsetoftestandmeasurementequipmenttoperformtestsfromapre-designedsetofteste 

con 6 ^ .*!T T » T T Ut6d ° n th6 SUT 3nd thS SyS,em model determinedTor the SUT, the ££23! : 
computes a list of fault candidates for the components of the SUT. Starting, e.g., from a priori failure probaSlfties of 
the components, these probabHtties may then be weighted with the moderation accor^TaCpasSs o 
fails. At least one test has to fail, otherwise the SUT is assumed to be good 

[0012] An embedded processor is a microprocessor or other digital computing circuit which is severefv limits m 

SSESET T TT SiZe b8CaUSe * iS emb6dd6d < i e buift in 5 ano 9 the?pmdu t Ex^ Sf « 
typically containing embedded processors include automobiles, trucks, major home appliances andZwcE com 
pu ers (wh lC h often contain an embedded maintenance processor in addion to the SSSSSZ S Em 
bedded processors typ.cally have available several orders of magnitude less memory and an order of maqni ude or 
1 J,?ntT P h P ° W , er than 8 deSkt ° P PerS ° nal COmputer - For exam P |e . a of memory wouTc Te a trae 

ZE5£S TI T* 11 iS deSirab,e t0 6nab,e SUCh a " embedded P ro « ^ a pZ dCole 
failures of product. A d,agnos.s engine providing such a capability shall be called an embecWed diagn^ 
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[0013] It is possible to perform probabilistic diagnosis by various heuristic methods, as applied by the aforemen- 
tioned HP Fault Detective product or US-A-5,808,919 (Preist et al.). Heuristics by nature trade off some accuracy for 
reduced computation time. However, the HP Fault Detective typically requires 4 to 8 megabytes of memory. This is 
can be a prohibitive amount for an embedded diagnosis engine. 

[0014] Another method for solving the problem is Monte Carlo simulation. Although the Monte Carlo simulation 
method can be made arbitrarily accurate (by increasing the number of simulations), the simulation results must be 
stored in a database that the diagnosis engine later reads. It has been shown that, even when stored in a space- 
efficient binary format, this database requires 2-6 megabytes for typical applications. This is too much for embedded 
application and would be a burden on distributed application where the database might have to be uploaded on a 
computer network for each diagnosis. 

[001 5] A common way of building a probabilistic diagnostic system is to use a Bayesian network (cf . Finn V. Jensen: 
"Bayesian Networks", Springer Verlag, 1997). A Bayesian network is a directed acyclic graph. Each node in the graph 
represents a random variable. An edge in the graph represents a probabilistic dependence between two random var- 
iables. A source (a node with no in-edges) is independent of all the other random variables and is tagged with its a 
priori probability. A non-source node is tagged with tables that give probabilities for the value of the node's random 
variable conditioned on all of the random variables upon which it is dependent. 

[0016] The computation on Bayesian networks of most use in diagnosis is called belief revision. Suppose values of 
some of the random variables (in the context of herein, the results of some tests) are observed. A belief revision 
algorithm computes the most likely probabilities for all the unobserved random variables given the observed ones. 
Belief revision is NP-hard (cf . M. R. Garey and D. S. Johnson: "Computers and Intractability: A guide to the theory of 
NP-completeness", W. H. Freeman and Co., 1979), and so all known algorithms have a worst-case computation time 
exponential in the number of random variables in the graph. 

[0017] Bayesian networks used for diagnosis are constructed with random variables and their dependencies repre- 
senting arbitrary cause-and-effect relationships among observables such as test results, unobservable state of the 
device under diagnosis and its components, and failure hypotheses. The graph can grow very large and have arbitrary 
topology. For example, an experimental Bayesian network used by Hewlett-Packard for printer diagnosis has over 
2,000 nodes. The complexity of such networks creates two difficulties: 

• all of the conditional probabilities for non-source nodes must be obtained or estimated, and 

o local changes to topology or conditional probabilities may have difficutt-to-understand global effects on diagnostic 
accuracy. 

[0018] In other words, the use of a large Bayesian net of arbitrary topology for diagnosis has somewhat the same 
potential for supportability problems as do rule-based diagnostic systems. 

SUMMARY OF THE INVENTION 

[0019] It is an object of the invention to provide an improved probabilistic diagnosis, which may also be applicable 
for embedded and/or remote applications. The object is solved by the independent claims. Preferred embodiments are 
shown by the dependent claims. 

[0020] The present invention provides a diagnosis engine: a tool that provides automatic assistance e.g. to a tech- 
nician at each stage of a debugging process by identifying components which are most likely to have failed. 
[0021] The major advantage of the invention over other diagnosis engines is that it can be provided with a small 
memory footprint: both code and runtime memory requirements are small, growing only linearly with the model size. 
[0022] The invention is preferably written entirely in Java (cf. e.g. James Gosling, Bill Joy, and Guy Steel: The Java 
Language Specification, Addison Wesley, 1996) and preferably uses only a few classes from the Java standard lan- 
guage library packages. These features make the invention in particular well suited to embedded and distributed ap- 
plications. 

[0023] The invention is meant to be used on applications where a failing device is to be debugged using a predeter- 
mined set of test and measurement equipment to perform tests from a pre-designed set of tests. For the purposes of 
herein, a test is a procedure performed on a device. A test has a finite number of possible outcomes. Many tests have 
two outcomes: pass and fail. For example, a test for repairing a computer may involve checking to see if a power supply 
voltage is between 4.9 and 5.1 volts. If it is, the test passes. If it isn't, the test fails. Tests may have additional outcomes, 
called failure modes. For example, a test may involve trying to start an automobile. If it starts, the test passes. Failure 
modes might include: 

• the lights go dim when the key is turned, and there is no noise from under the hood, 
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• the lights stay bright when the key is turned, and there is the noise of a single click, 

• ISmoS, a?d 7 s b o ri ?ol When iS tUmed ' ' here " 3 C,iCk ' the St8rter m ° t0r tUms ' but ,he engine doesn,t 

[0024] The set of all tests available for debugging a particular device is called that device's test suite. Many applica- 
tions fit these definitions of debugging and of tests. Examples are: V PP 

• computer and electronics service and manufacturing rework, 

• servicing products such as automobiles and home appliances, and 

• telephone support fits the model, if we broaden the idea of "test" to include obtaining answers to verbal questions. 
[0025] Given: 

° fa?d 0, and tS °" * ***** (6 ' 9 " ™ 1 = ^ ™ 2 = fa "' T6St 3 = paSS ' eta) where at least one test ^ 

' l^/V 9 C T!T ° f thS t68tS ° n ,he com P° nents ( e -9- "eld replaceable units) of the object and infor- 
mation descnbing probabilistic dependencies between tests, 

the invention outputs a probabilistic diagnosis of the object, that is, a list, each element of which contains: 

• a list of one or more components, and 

' tmiS LhS?'* '"I ,h0SS C ° mPOnen,S the b3d c ° W's- (Likelihood is un-normalized proba- 
bility. That is, probabilities must sum to one but likelihoods need not.) 

[0026] Mostautomateddiagnosissystemsprovidesimplyalistofpossiblediagnoseswithoutweightingbyprobabilitv 
Having probabilities Is particularly desirable in app.ications where the number of field replaceable units (fZ m 

an opportunity to apply their own expertise. giveiecnnicians 
[0027] The invention allows handling multiple component failures. No distinction is made between single and multiple 

[0028] The invention combines the model-based (of. W. Hamscher, L. Console, and J. de Kleer: Readings in model- 

SiS'i T^' 1 " 2) and Pr ° babi,iStiC approaches t0 dia 9 nos,ics - The inven «o" the same 
tesU>ased model as by the aforementioned HP Fault Detective or in US-A-5,808,919 (Preist et al.). This model de- 
scribes probabilistic relationships between tests and the components that they test in a manner intended to be acces- 
sible to engineers who write tests. Features of this model can be preferably: 

• a two-level part-whole hierarchy: names of components (field-replaceable units) and of their sub-components, 

• estimates of a priori failure probabilities of the components, 

• the names of the tests in the test suite, 

• an estimate of the coverage that each test has on each component, i.e., the proportion of the functionality of the 
component that is exercised by the test, or more formally, the conditional probability that the test will fail given that 
the component is bad, 

• sharedcoverages of tests, whito 

of some components in exactly the same way (for example, two tests that access a certain component throuqh a 
common cable have shared coverage on the cable), and 

• a way of specifying failure modes for tests in addition to pass and fail. Failure modes have a name and two lists 
of components or sub-components. The first list, called the acquit list, names the components or sub-components 
that must have some operable functionality in order for the failure mode to occur. The second list, called the indict 
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list, names the components or sub-components that may be bad if the failure mode occurs. Each entry in the acquit 
and indict lists also contains an estimate of the amount of functionality of the component that the failure mode 
exercises. 

[0029] Models can be created e.g. : 

• Using a model-building graphical user interface (GUI) that comes e.g. with the aforementioned HP Fault Detective. 
The HP Fault Detective model is read by a program that translates it into a simpler form used internally by the 
invention, which can be saved as an ASCII file. The invention can load such a file from a file system, from a URL, 
or from local memory. 

• By writing ASCII test Fault Detective Model (.fdm) files, or 

• Through a model creation application programming interface (API) in Java. 

[0030] The model, together with the rules of mathematical logic, enables to compute the probability that a test will 
fail if a particular component is known to be bad. More details about these models and the model-building process are 
disclosed in the co-pending US patent application (Applicant's internal reference number: US 20-99-0042) by the same 
applicant and in US-A-5,922,079 (Booth et al.). The teaching of the former document with respect to the description 
of the model and the model-building process are incorporated herein by reference. 

[0031 ] The invention allows computing the probability of a test's failure when given any pattern of components known 
to be good or bad. The logic formula known as Bayes' Theorem allows running this computation in reverse: given a 
particular test result, the invention can calculate the probability of occurrence of some particular pattern of component 
faults and non-faults. The invention, then, enumerates all the possible patterns of component faults/non-faults, evalu- 
ating the probability of each pattern given the test result. The pattern with highest probability is selected as the diagnosis. 
[0032] Of course, one test is seldom sufficient to make an unambiguous diagnosis. If the test succeeds, it may clear 
some components, but not indicate the culprit. If it fails, it may indict several components, and other tests are required 
to clear some or focus suspicion on others. (Here, "clearing" means to knock the computed fault probability way down, 
and "focusing suspicion" means to raise the probability to the top or near the top.) Handling multiple test results is easy 
and quick if the tests are independent of each other. But if the tests are not independent, the problem is much more 
complex. The dependence is modeled by the shared functions. A case-by-case breakdown must be made of ail the 
ways the shared functions might pass or fail and how they affect the joint probabilities of the test results. Then all these 
influences must be summed, as sketched e.g. in the outline of a diagnosis algorithm (in pseudo-code) as shown below: 

1 . For each possible combination of bad components: 

(a) Set sum to 0. 

(b) For each possible pass/fail combination of shared functions: 

i. Compute the probability of the observed test results. 

ii. Add the probability to sum. 

(c) Calculate likelihood of the combination of bad components given sum (using Bayes' Theorem). 

2. Sort the fault likelihoods in descending order. 

[0033] The algorithm iterates over combinations of failed components and computes the conditional likelihood of 
each combination given passed and failed tests. 

[0034] Clearly, this method can require enormous amounts of computation as it explores all combinations of shared 
function outcomes for all combinations of faults. 

[0035] The mathematical detail how all this is to be accomplished and also how the computational burden is reduced 
to allow the method to be practical will be shown in great detail in the section 'Detailed Description of the Invention'. 
[0036] Any model as used by the invention can be represented by a Bayesian network. The resulting graph is tripartite, 
consisting solely of sources, sinks, and one level of internal nodes (as shown later). There is one source for each 
component. There is one sink for each test. Each shared function is represented by one internal node. However, in 
order to represent test coverage information, the so-called "Noisy-or" (defined and described in detail in chapter 3 of 
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Srm^ jm construction must be used. The form o* test coverage 

This means that the Bayesian netwo* wW ZT"! 1~ S (a9ain ' See Chapter 3 of Jensen) cannot be used 
covered by any test. Even small models exhaus TtL ^Tnfl^ 0 ^^ 1 in the number <* components 
well suited to embedded or distributed a PP Son * **** P ° workstati °<*- dearly this approach is not 

high accuracy rate of successful diagnosis with the invention ^ 9 ^ reviSion over ,his sub class. The 

powerful to represent ^^S^tS.T***^* 1 ^ 
f"cewiththepresent Invention ma% e an advXewtS 

form construction of a Bayesian network 9 " d SUpp0rtln9 a model whe " compared with free- 

practice. It runs about as fast as the atoM e^^ ° f ,he inv8ntlon is *«• in 

[0039J , n a nutshe.1, the invention is based oS a^umpl^at ^ d ' a9n ° SiS Pr ° b ' 6ms - 

1 • Component states (that is, whether each component is good or bad) are probabiiisticaily independent; 

- givers 

SXStK'Sr ^ ^ ^ iS PaSS6d ° r ~ Probabi,i8tical V dependent given component 

[0042] Assumption3isusedtocomputeth^ 
» ^ofsharedfunctionsbeingS^^^ 
partteularsetofsharedfunc^sareE 
LUU43J Thus, the invention provides: 

3s 1Mean80f8p "^ 

noSbSe^ 

^ 3. Means of specifying how many components may be simultaneously bad. Call this number N. 

• the computation is exact (to within small floating point computation error); and 

amount that is the same independent of the moSsS^T ^ Sha " ^ " 

5. Means of outputting the likelihoods, either 

• in human-readable form, or 

• as computer data available for further automatic processing. 
SSo^ 

[0045, The invention thus ai.ows construction of a diagnosis engine which wi„ reouire an amount of memory ,ess 
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than the amount of memory required to store the model and the amount of memory required to store the output multiplied 
by a small factor which is a constant independent of the model and output sizes. This makes such a diagnosis engine 
well suited to use as an embedded diagnosis engine. 

[0046] It is clear that the invention can be partly or entirely embodied by one or more suitable software programs, 
5 which can be stored on or otherwise provided by any kind of data carrier, and which might be executed in or by any 
suitable data processing unit. 

DETAILED DESCRIPTION OF THE INVENTION 

10 [0047] Table 1 shows the notation for items from the coverage-based model, such as components, tests, and shared 
functions and gives formal definitions for coverage and shared function variability. 

[0048] For the sake of simplicity, the term "Equation" shall be used in the following not only for pure mathematical 
equations but also for mathematical terms to which it is referenced in this description. 

[0049] The components <D are random variables that can have the states {good, bad}. The prior probabilities P (c 
is bad) fore E <D are given in the model. Component failures are assumed to be independent as defined in Equation (0. 1 ). 
[0050] Shared functions are used to express probabilistic dependencies between tests. Shared functions may be 
thought of as expressing the fact that some functionality is checked by different tests in exactly the same way. The 
shared functions Q are random variables with the states {pass, fail}. A shared function s is dependent on the states of 
the components O as shown in Equation (0.2), where the shared function coverages sfcov(s, c) is given in the model. 
20 The shared functions are conditionally independent of one another as given in Equation (1). 

[0051] Intuitively, a shared function fails if it has coverage on a bad spot in any component on which the shared 
function has coverage. Formally, the probability of a shared function s failing dependent on the states of all components 
is defined by Equations (2) and (3). 

[0052] Equation (3) means that the computation can be performed while iterating through a sparse representation 
25 of sfcov. Each shared function s has a variability, sfvar(s) between 0 and 1 . Intuitively, variability of a shared function 
says how correlated are the failures of the tests that use the shared function when the shared function is failed. A 
variability of 0 means that all tests that use the shared function will fail if the shared function fails. A variability of 1 
means that tests that use the shared function may fail independently of each other if the shared function fails. In this 
case, the shared function is being used as a modeling convenience. The notion of shared function variability will be 
30 formalized below. 

[0053] The tests y are random variables with the states {pass, fail}. Generally, only some of the tests are performed. 

Let n be the passed tests. Let <p be the failed tests. A test is dependent on both components and shared functions. 

The coverages P are defined by Equation (3.1) and given in the model. The shared functions used by a test sfused(t) 

c G is also given in the model. Tests are conditionally independent of one another given the states of all components 
35 and shared functions as shown in Equation (4). 

[0054] If a test uses no shared functions, its probability of failure depends on the component states. Intuitively, a test 

fails if it has coverage on a bad spot. Formally, the probability of a test t failing, when t uses no shared functions, 

dependent on the states of all components is defined by Equations (5) and (6). Equation 6 means that the computation 

can be performed by iterating through a sparse representation of cov. 
40 [0055] When a test uses shared functions, it can also fail if any of those shared functions fail. Let's assume Equation 

(6.1). The conditional probability of test success is then given in Equation (7) and the conditional probability of test 

failure is its complement as shown by Equation (8). 

[0056] All probabilistic dependencies between the three sets of random variables <D, Q, and y are given in the afore- 
mentioned Equations. Otherwise the random variables are independent. Thus, the dependencies among the random 

45 variables could be represented by a Bayesian Network where the sources of the directed acyclic graph (DAG) are • 
the components <D and the sinks are the tests y. Each nonzero entry in cov, say cov(t, c), results in an edge from 
component node c to test node t. Each nonzero entry in sfcov, say sfcov(s, c) results in an edge from component node 
c to shared function node s. For each element s e sf used(t) there is an edge from shared function node s to test node t. 
[0057] Given the above definitions, it is now possible to give the diagnosis algorithm according to the invention. The 

so algorithm is simply to compute and sort posteriori likelihoods of component configurations given test results. Let % c 
y be the passed tests. Let q> c y be the failed tests. Bayes' Rule gives Equation (9). 

[0058] All of these conditional probabilities will be normalized by the same quantity P (ic, <p). This quantity is the prior 
probability of the test results and is difficult to compute. So the invention uses the likelihood of Equation (10). 
[0059] The only nontrivial quantity to compute is P (it, <p I C, C). If there are no shared functions, this is easy and 
55 leads to Equation (11), where P (n I C, C), the probability of the passed tests given the test resutts, is given in Equations 
(12)-(14), and P (q> I C, C), the probability of the failed tests given the test results, is given in Equations (15) and (16). 
[0060] If there are shared functions, then use the law of total probability of Equation (17), where the first factor in the 
summand is in turn a product of factors computed according to Equations (7) and (8) as given in Equation (18). 
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[0061 ] The conditional probabilities of the shared function states are computed exactly like the test result probabilities 
of Equation (11) as given in Equations (19)-(21). 

Improving Computation Time 

[0062] Diagnosis could be performed by straightf onward evaluation of Equations (10) through (1 7) for each possible 
. state of the components and shared functions. However, that approach would take 

time, which is clearly unacceptable for most practical applications. According to the invention, techniques for reducing 
the computation time can be applied, the most important of which are: 

• reducing the number of candidate diagnoses, i.e., of component states (C, C)forwhich posteriori likelihood Equa- 
tion (1 0) is computed, and M 

• reducing the time required to evaluate Equation (1 7) by eliminating states of the shared function power set which 
do not affect the sum. 

a) Reducing the number of candidate diagnoses 

[0063J First, let's consider heuristics for reducing the number of component states. This can be achieved by makina 
a reasonable assumption concerning the maximum number of simultaneously failed components. The invention as- 
sumes that component failures are independent. So unless the prior probabilities of failure are large, multiple failures 
are rare. This observation suggests choosing a maximum number of simultaneous failures N and computinq Equation 
(10) only for those C c <D with 1 < ICI s N. This is the strategy preferably used by the invention 
[0064] Another strategy is that used by the aforementioned HP FaultDetective, is based on Occam's Razor postulate 
only as many failed components as necessary. In other words, take N = 1 and compute the likelihoods If any likelihood 
is nonzero, stop. Otherwise increase N by one and repeat. This way, a set of diagnoses is found with the minimum 
cardinality necessary to explain the test results. There are two dangers to this approach: 

1 . In pathological situations where unmodeled dependencies exist between tests, the algorithm may not stop in a 
reasonable amount of time. This can occur for example when a test fixture is set up incorrectly. 

2. The Bayesian algorithm produces a nonzero likelihood for a diagnosis if has any chance whatsoever A likelihood 
threshold would have to be set, but it is hard to set when the hard-to-determine denominator is beinq omitted from 
Equation (9). 

[0065] This strategy works well with the HP FaultDetective but does not work well with the invention because the 
invention can find candidate diagnoses with extremely small likelihoods. Even when ICI = 1 , the invention will find some 
diagnoses with small likelihoods, for example 1 0" 50 or even 1 0" 100 . 

b) Active shared functions 

[0066] Now lefs considerthe problem of reducing the size of the power set K(Q) over which Equation 1 7 is summed 
It is evident that a shared function plays a role in diagnosing only those components over which it has coverage and 
only when at least one conducted test makes use of the shared function. Therefore, Equation 1 7 may be summed over 
the much smaller power set of Equation (21 .1 ), where a is the active shared function set as defined in Equation (22) 
which uses the provisional active shared function set, which is defined as in Equation (22.1 ). 
[00671 The restriction to K(ft) is justified in the Equations because the states of K(fl) can be paired relative to any 
shared function s, so.that the members of each pair are identical except for s passing in one and falling in the other 
If s is not used by any test in (* u <p), then Equation (22.2) is invariant for the pair, and the sum of Equation (22 3) 
S^E5*«SSro^? -,lty ***° °*»»«"*»m f >e»r»iTte of state, so summing w>o fMlre csauns «j roF » ouE off-ffor tTm v>un>«MM 
[0068] As forthe restriction of 0, consider Equation (20). If a shared function s e o has no coverage on any presumed 
faulty component c e C, then sfcov(s, c) is uniformly zero, implying that the innermost product in Equation (20) is 1 
for that s. This forces a factor of zero in the outermost product, making Equation (22.4). That result backs through 
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Equation (19) into Equation (17), making the whole term zero. Thus, no state term need be evaluated which posits 
failure for such a shared function. And again, if a state posits that the shared function succeed, it will simply cause a 
"1" to be factored into the product of Equation (21). So there is no reason to include that shared function in the state 
over which Equation (17) is summed. 

5 [0069] The provisional active shared function setQ can be quickly computed once at the beginning of the diagnosis, 
since it depends only on the tests which have been conducted. If the conducted tests are few relative to the available 
tests, this can effect a considerable reduction in the number of shared functions under consideration. The active shared 
function set O is winnowed from this separately for each combination of faulty components to be evaluated. Limiting 
the number of simultaneous faults to be considered (cf. above) usually produces a major reduction in the size of this set. 

10 [0070] Some examples with the number of active shared functions for different models are shown in Table 2. The 
first four columns of the table give the name of the model, and the number of components, tests, and shared functions 
in the model. Column 5 shows the maximum number of active shared functions for which the state has been observed 
to be expanded. That many active shared functions are not always encountered. Column 7 gives the average size of 
the power set over which Equation 1 7 is expanded, which is the average of 

15 

2#adlve SFs 

[0071] This is the computational time factor paid for handling the shared functions. Column 6 is the base-2 log of 
20 column 7, giving the effective "average" number of active shared functions. The Boise data set is for a disk drive 
controller board, and Lynx3 is a board from a PC. The observed figures for them were derived over many runs of actual 
test results, and the effective average SF figures were almost always the same to the second decimal. Cofxhfdf is a 
model of a spectrum monitoring system, a small building full of radios, measurement equipment, and the cabling be- 
tween them. The figures.in the table were derived by arbitrarily having the first 30 tests fail and the next 30 tests pass. 
25 This is an artificial test result, but such large numbers of test failures do occur for the spectrum monitoring system. The 
result is encouraging, for the expansion factor of 5.85 is nowhere near 2 203 . Running that diagnosis took 7.9 seconds 
of real time on a 200MHz Pentium Pro computer, which includes the time for loading and starting the program, and 
reading in the model. The program is written in Java. 

30 c) Short Circuiting 

[0072] The first product of Equation (1 8) can be zero if a passed test with 1 00% coverage clears an assumed bad 
component. It is actually bad form for a model to claim 100% coverage, so it may not be worthwhile to check for this. 
A more interesting case is that a term of. the second product is zero. This means that no assumed-bad component 
35 could have caused one of the failed tests to fall. It is worth checking for this condition to avoid needless processing. 

d) Factoring 

[0073] In computing the sum of Equation (1 7), its first factor expands according to Equation (18), of which the first 
40 product is computed according to Equation (7). This in turn contains the factor of Equation (22.5), which is invariant 
over all the terms of the sum, and can therefore be pulled out of the loop. 

e) Miscellaneous 

45 [0074] The above speedups reduce the order of complexity of the algorithm. Other programming techniques also 
serve to reduce the required processing time. For example, coverages of failed tests must be matched against failed 
components. This goes faster if components, tests, and coverages are kept sorted. Bitmaps can be used to compute 
set intersections or unions, as for winnowing out the active shared function set. But the active shared function set 
should be kept in a squeezed representation for enumerating all of its states. It is well to pre-allocate arrays where 

so possible, to avoid allocating them and freeing them during execution. 

Conclusions 

[0075] It will be apparent to those skilled in the art from the detailed description and the following procedures that a 
55 diagnosis engine constructed according to the present invention will require an amount of memory less than the amount 
of memory required to store the model and the amount of memory required to store the output multiplied by a small 
factor which is a constant independent of the model and output sizes. This makes such a diagnosis engine well suited 
to use as an embedded diagnosis engine. 
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[0076] The effect of obtaining the low memory consumption comes from using equations used to exactly compute 
the conditional likelihood of the component failures given the test results. Using the above equation numbering th s 
would be Equation 10, which contains values that must be computed from Equation 1 7 (which in turn uses Equation 
18 which in turn uses Equations 19, 20, and 21) and the independence equation (0.1). However, it is clear that the 
content of those equations can be expressed also by other equations without departing from the scope of the present 

Mivcrmun. 

[0077] The invention allows that little memory is needed to compute the diagnosis in addition to that needed to store 
the model and the output. For better illustration, it shall be identified what additional memory is needed to compute the 
diagnosis. The effect of low memory consumption comes from the features that use that additional memory. 
[0078] Computing the values of the left-hand sides of these equations 1 0 and 1 7-21 from the right hand sides does 
not require the storage of any intermediate results other than: 

• a floating point register or memory location to accumulate the sum in Equation 1 7, 

» a floating point register or memory location to accumulate the products in Equation 1 8, 

• two floating point registers or memory locations to accumulate the products in Equation 20, 

• one floating point register of memory location to accumulate the products in Equation 21 . 

[0079] That means that minimum 4 floating point registers or 4 memory locations are needed for calculatinq the 
equations 1 0 and 17-21. 

[0080] In order to improve computation time, the Active Shared Functions (Equation 22) can be applied This how- 
ever, increases the amount of memory needed in order to compute and store the Active Shared Functions There are 
two objects that must be stored in memory: the Provisional Active Shared Functions and the Active Shared Functions 
[0081] The Provisional Active Shared Functions are normally computed once, typically before the computation is 
started. The Provisional ActiveShared Functions are a subset of the Shared Functions. Onewaytostore the Provisional 
Active Shared Functions is as an array of integers, where each integer in the array gives the index of a Shared Function 
that is a Provisional Active Shared Function. So, the Provisional Active Shared Functions can be stored in p integer 
sized memory locations, where p is the number of Provisional Active Shared Functions, which is less than the number 
of Shared Functions. The Active Shared Functions change during the course of the diagnosis computation However 
there is only one set of Active Shared Functions at any one time. The Active Shared Functions are a subset of the 
Provisional Active Shared Functions. So, the Active Shared Functions can be stored in no more integer memory loca- 
tions than the number of Shared Functions. 

[0082] In a nutshell, this means that the effect of small memory consumption comes from the direct and exact eval- 
uation of statistical equations, such as Equations 1 0 and 1 7-21 . Computing the values of the left-hand sides of these 
equations requires only a few floating point registers/memories. In order for this evaluation to be performed more 
efficiently, Provisional Active Shared Function and Active Shared Function sets can also be computed These sets 
each require no more integer memory locations than the number of Shared Functions in the model. Thus the number 
of temporary memory locations needed to compute the diagnosis grows linearly with the number of Shared Functions 
To obtain the overall memory requirement, memory to store the model and the output of the diagnosis must also be 
added. 



Illustrative example 

[0083] The effect of the invention can also be explained in more descriptive way. Among searching methods that 
seek the best of a large number of combinations, there are two principal variants: those that search depth-first and 
those that search breadth-first. As a more pictorial example, it shall be assumed that the largest apple on the tree is 
to be found. 

[0084] For the depth-first search, one will go up the trunk to the first branch and follow that branch When that 
branch divides, one will follow the larger subbranch, and so on, each time following the larger subbranch One will 
eventually come to an apple, or to a leaf, or an end of a twig. If it is an apple, its position is jotted down and sized on 
a slate. Then one will go back down to the base of that last branch and explore up the alternative branch If one will 
ever find an apple that is bigger than the one noted on the slate, the slate will be erased and the position and size of 
the new apple is noted. Eventually, the whole tree will have been explored, and the slate never had to record more 
than one apple's position and size. It has to be kept keep track of where one has been, but that doesn't take too much 
memory. All what is required is a list of the form: 1st layer of branches: 3rd alternative; 2nd layer: 2nd alternative* and 
so on. If the tree has no more than ten layers of branches upon branches upon branches, one will only have to keep 
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a list of ten entries. 

[0085] It is clear that this procedure requires a certain amount of time. Most likely, the biggest apple is on or near 
one of the bigger, low-hanging branches. To exploit this, one will do a breadth-first search. The size of the first layer 
of branches is surveyed. It will be started by looking for apples on the largest, and the branches will be followed looking 

5 for apples. But if it ever gets so far out in the bushiness that the branch one is on is smaller than some other in the 
survey, a note of the present location is made and that other branch will be explored. This way, one will always exploring 
the largest known previously-unexamined branch. A note is kept of the biggest apple so far. If one will ever come to a 
branch which is too small to support an apple of that size, that branch needs not be explored any farther, nor any of 
its subtwigs. This builds a fast-growing list of branches which one will might need to come back to, but the reward is 

10 that one will always look in the most likely places. 

[0086] The invention thus minimizes the amount of storage required because it does a depth-first search. In order 
to improve computation time, the invention can apply a breadth-first search (corresponding an application of shared 
functions) in that it "looks at the tree" and finds that "most of the boughs are dead and barren of leaves and fruit", so 
it doesnt bother traversing them. And once it is up in the tree, it keeps avoiding dead branches, and ignoring grafted- 

15 on branches from orange trees. 

Computer code examples 

[0087] The three procedures in the attachments outline in words examples of computer code that could be used to 
20 implement the present invention. The first Procedure Diagnose otAttachment (a) gives the method for computing the 
likelihood of component failures using the speed improvements described in a) and b) of the section 'Improving Com- 
putation Time'. Procedure evalDiagnosis of Attachment (b) and Procedure findSfPEsc of Attachment (c) are used by 
the Procedure Diagnose. 

25 
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Attachment (a): Procedure Diagnose 

Parameters: 

• model 

• passed tests array 

• failed tests array 
Produces 

• list of possible diagnoses, sorted in descending order of likelihood 

1. Generate provisionalActiveSFs, an array of integers. This contains in 
ascending order, the identification number of each shared function which 
is depended on by any test in the passed and failed test arrays. ' 

2. Create the diagnosis list as an empty set. 

3. for N := 1 ... maximum number of simultaneous faults to be considered 

a. for C (the component set) running through all combinations of N faultv 
components: ' 

i. Generate activeSFs, an array of integers. This contains the subset of 
provisionalActiveSFs, which depend on any component in C. 

ii. Evaluate the likelihood of C (and only C) being the failed components 
by calling evalDiagnosis, giving it C and activeSFs. 

iii. If the likelihood > 0, then make an entry in the diagnoses list 
containing C and its associated likelihood. 

4. Sort the diagnosis list in order of descending likelihood. 

5. Return the diagnosis list. 
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*************************************************************************** 

Attachment (b): Procedure evalDiagnosis 

Parameters 

• model 

• passed tests array 

• failed tests array 

• C, the list of assumed-bad components 

• activeSFs, the array of shared functions which both depend on a 
component in C, and are used by a passed or failed test. 

Produces 

• a number giving the likelihood that the pattern of passed and failed tests 
could have been caused by the failure of all the components in C, and 
no others. 

1. Call the number of bits in an integer memory location b. If activeSFs has 
more than b elements, signal an error. (In the preferred embodiment b=32 
because the preferred embodiment uses the Java language which uses 32 
bit integers.) . 

2. Compute Pprior := the prior probability that all components in C fail while all 
others succeed. (This is the product of the individual prior probabilities.) 

3. Compute sfPEscape := an array of numbers with an entry for each shared 
function, giving the probability that the shared function will pass given bad 
components C. (See procedure findSfPEsc.) 

4. Set sumProb := 0. 

5. for sfPattern := 0 to 2 A (#active SFs) -1 (We interpret the integer sfPattern as 
a small array of bits: counting from the right, if bit i is 1,then the i-th active 
shared function fails. Otherwise, the shared function passes. This is actually 
shared function number activeSFsp].) 

a. Compute PSFPat := the probability of occurrence of this pattern, by 
multiplying together the separate probabilities of the active shared 
functions. Those probabilities are: 

i. 1-sfPEscape[ activeSFs[i] ], if bit i is 1 in sfPattern 
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ii. sfPEscape[ activeSFs[i] J, if bit i is 0 in sfPattern 

b. Compute condPofPassed := probability that all passed tests ought to 
have passed, given the bad components C and the failed shared 
functions indicated by sfPattern. 

c. Compute condPofFailed := probability that all failed tests ought to have 
failed, given the bad components C and the failed shared functions 
indicated by sfPattern. 

d. Add PSFPat * condPofPassed * condPofFailed to sumProb. 
6. Return Pprior * sumProb. 



Attachment (c): Procedure findSfPEsc 

Parameters 

• model 

• C, an array of assumed-bad components 

• activeSFs, an array of active shared functions. 
Produces 

• an array of numbers with as many entries as there are shared functions 
in the model (not just active shared functions), giving the probability that 
the shared function will pass, given bad components C. 

1. sfPEscape := new numeric array with one entry for each shared function 
(not just active, but all of them). 

2. Set each element of sfPEscape to 1. 

3. For s := each element in activeSFs 

a. For c := each component covered by s 
i. If c is in C, then multiply (1-sfcoverage(s, c)) into sfPEscape[s] 

4. Return sfPEscape. 
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Claims 

1 . A diagnosis engine for diagnosing a device with a plurality of components, wherein: 

states of the components are assumed to be probabilistically independent, for computing the probability of 
any particular set of components being bad and all others being good, 

states of shared functions, applicable for testing the functionality of some components in the same way, are 
assumed to be probabilistically independent given component states, for computing the probability of any 
particular set of shared functions being failed and another particular set of shared functions being passed 
given that a particular set of components are bad, and 

states of tests applicable on the device are assumed to be probabilistically independent given component and 
shared function states, for computing the probability of any particular set of tests being failed and another 
particular set of shared functions being passed given that a particular set of components are bad, and the rest 
are good, and a particular set of shared functions are failed, and the rest are passed; 

the diagnosis engine receiving: 

test results of a set of tests on the device where at least one test has failed, and 

a model giving the coverage of the tests on the components of the device and information describing proba- 
bilistic dependencies between the tests; and 

the diagnosis engine comprising: 

means for setting or specifying a number N of components which may be simultaneously bad, and 

computing means for computing the likelihood that each of subsets of the components with size less than or 
equal to N are the bad components, whereby the computation is substantially exact within floating point com- 
putation errors. 

2. A diagnosis engine for diagnosing a device with a plurality of components, wherein: 

states of the components are assumed to be probabilistically independent, for computing the probability of 
any particular set of components being bad and all others being good, 

states of shared functions, applicable for testing the functionality of some components in the same way, are 
assumed to be probabilistically independent given component states, for computing the probability of any 
particular set of shared functions being failed and another particular set of shared functions being passed 
given that a particular set of components are bad, and 

states of tests applicable on the device are assumed to be probabilistically independent given component and 
shared function states, for computing the probability of any particular set of tests being failed and another 
particular set of shared functions being passed given that a particular set of components are bad, and the rest 
are good, and a particular set of shared functions are failed, and the rest are passed; 

the diagnosis engine comprising: 

means for specifying the component prior probabilities, coverages, and shared function coverages, 

means for specifying which tests have passed or failed or which were not performed, 

means for setting or specifying a number N of components which may be simultaneously bad, and 

computing means for computing the likelihood that each of subsets of the components with size less than or 
equal to N are the bad components, whereby the computation is substantially exact within floating point com- 
putation errors. 
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3. The diagnosis engine of claim 1 or 2, further comprising- 

means of outputting for one or more of the components the likelihood that the component is bad. 

4. The diagnosis engine of claim 1 or 2, further comprising- 

8. The diagnosis engine of claim 7, comprising: 

a floating point register or memory location to accumulate the sum in Equation (17), 
a floating point register or memory location to accumulate the products in Equation (1 8), 
two floating point registers or memory locations to accumulate the products in Equation (20), and 
one floating point register of memory location to accumulate the products in Equation (21). 

"'blV^ 

11. A method for diagnosing a device with a plurality of components, whereby: 

the method comprising the steps of: 

receiving test results of a set of tests on the device where at least one test has failed, and 

setting or specifying a number N of components which may be simuftaneously bad, and 

computing the likelihood that each of subsets ofthe components wrthsizeless than or equaltoNare the bad 
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components, whereby the computation is substantially exact within floating point computation errors. 

12. A software product, preferably stored on a data carrier, for executing the method of claim 1 0 when run on a data 
processing system such as a computer. 

13. A software program, adapted to be stored on or otherwise provided by any kind of data carrier, for executing the 
steps of the method of claim 1 0 when run in or by any suitable data processing unit. 

14. A diagnosis engine for diagnosing a device with a plurality of components, wherein: 

states of the components are assumed to be probabilistically independent, for computing the probability of 
any particular set of components being bad and ail others being good, and 

states of tests applicable on the device are assumed to be probabilistically independent given component 
states, for computing the probability of any particular set of tests being failed given that a particular set of 
components are bad, and the rest are good,; 

the diagnosis engine receiving: 

test results of a set of tests on the device where at least one test has failed, and 

a model giving the coverage of the tests on the components of the device and information describing proba- 
bilistic dependencies between the tests; and 

the diagnosis engine comprising: 

means for setting or specifying a number N of components which may be simultaneously bad, and 

computing means for computing the likelihood that each of subsets of the components with size less than or 
equal to N are the bad components, whereby the computation is substantially exact within floating point com- 
putation errors. 
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Table 1: Summary of notation 
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284 
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Table 2 
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P(S C Q fail | C C $ bad, C good) = ]J P(s fail | C C $ bad, C good) (1) 

P(s e fi fail|C C $ bad, Cgood) = 1 - - sfcov(s.c)) (2) 

= 1- n (l-sfcov(s,c)) (3) 
ceC 
sfcov(s, c) ^ 0 

P(T C * fail | C bad, ^ good, 5 fail, 5 pass) = ]} f ai ] I c > ^> 5 > 3) ( 4 > 

G 

P(< G * fail | C C $ bad, C good) = 1 — 11(1 — cov(<, c)) (5) 

= 1- II (l-cov(«,c)) (6) 
cGC 
cov(i, c) # 0 

P(t€$ passed |C C $ bad, (7 good, SCQ failed, 5 passed) 

= H(l-cov(t,c)) II (l-sfprob(s)P(sfailed|C,C,5,S))) (7) 
c ec jesfused(t) 

Q P(t G * failed | C C $ bad, C good, SCft failed, S passed) 

= l-n(l-cov(i,c)) J] (1 - sfprob(s)P(s failed | C.U, S,3))) (8) 
£ 6C scsfused(«) 

f ■ _ Jr .,. P(ir pass, fail |C,U)P(C,C) „ 

P(C bad, C good | , pass, # faQ = L^-J-^i (9 ) 

L(C bad, C good | = P(ir pass, <j> fail | C, C)P(C, C). (10) 
P(7r,<^| C,C) = P(tt pass|C, C)P(<£ fail|C,U) (11) 
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= nn^pass| C ) 
= nno-cov^c)) 



(12) 

(13) 
(14) 



(15) 
(16) 



P(<I>M\C,C) = H^failedlC,?) 

= nfi-iid-cov^c))) 

'*' C (.^^^"■^'^'^P^^ (17) 

t£<f> 

P(*fail,<f passjC.C) = P(afail|C,C)P(apassjC,C) (19) 

^ f *c )= n(i-ri(i-s fc o V(5lC)) ) 



^pas S |C,5) = IJIJ (1 _ sfcov( 



« = {56Q|3ceC => sfcov( s , c )>o) 
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P{C bad, CC$)= n p ( cbad ) ( OA ) 

P($ € SI failcd|c € $ bad) = sfcov(s, c) ( 0 ^) 

P(t € * failed|c € $ bad) = cov(t,c) 



sfprob(i) = 1 - sfvar(i)/2 



(a failed , a passed) £ k\Q) 



n= |J sfased(t) 



(21.0 



(2^0 



P(a failed, <f passed |C,<7) (2L*i) 
P(afail|C,C) = 0 ^22*f) 

Ila-covM) (22.5) 
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